Improving Least Squares
Improving Least Squares
Rachel Filderman, Latifa Hasan, Emily Murphy, Shannon Paylor
12/8/2020
Introduction
Have you recently mastered the introductory concepts in producing statistical models? Congratulations! With this knowledge, it’s now time to consider fine tuning model parameters further in order to improve the ordinary least squares (OLS) approach. One way we can do this is by determining if all of the parameters or variables are important in our model or if we instead can produce a better model after reducing the variables in the model through feature selection. You were likely introduced to subset selection in an introductory statistical model course, but there are other methods, as well, to improving OLS with feature selection. We will cover the following techniques in this tutorial with respect to the each method’s motivation and algorithm:
- Subset Selection
- Best Subsets
- Stepwise Selection/Elimination
- Penalized Regression
- Ridge
- Lasso
- Elastic Net
- Dimension Reduction
- Principal Component Analysis (PCA)
- Partial Least Squares (PLS)
In addition, we will demonstrate these techniques through a coding example with the following dataset.
Before we dive into dimension reduction, let’s refresh a few key concepts.
Review:
Supervised vs Unsupervised Learning: supervised learning involves modeling a response variable on one or more features, while unsupervised learning involves finding underlying structure in data features without relating to a predictor variable.
Response Variable: the variable of interest in a prediction problem, usually denoted \(Y\).
Features/predictor Variables: data used in modeling, usually denoted \(X\). In a supervised learning problem, these features are used to predict the response variable.
Standardizing Variables: re-scaling variables to have mean zero and variance one. Standardizing can be especially important when variables are on widely differing scales (e.g. one feature on a percentage scale with values between zero and one and another feature measured in dollars on a scale from zero to several thousand).
Bias & Variance: bias is a measure of how far the mean/expected value of an estimator is from the true value, or \(E(\hat Y) - Y\). Variance is a measure of how much the squared expected value of an estimator differs from the expected value of the squared estimator, or \(E(\hat Y^2) - E(\hat Y)^2\). In other words, bias measures the error in the center of the estimate, and variance measures the error in the spread of the estimate. Generally, as model complexity increases, bias increases and variance decreases. This is frequently referred to as the bias-variance tradeoff.
Model Selection Criteria: Model selection criteria are used to evaluate and compare subset regression models. Common criteria used include:
- Adjusted \(R^2\) = \(1-(\frac{n-1}{n-p})(1-R^2_p)\)
- Akaike Information Criterion (AIC) = $ -2ln(L)+2p$, where L is the likelihood function for a specific model
- Bayesian Information Criterion (BIC) = \(-2ln(L)+p*ln(n)\)
- Mean Square Error (MSE) = \(\frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2\)
Why improve on least squares?
OLS is not well-suited to high dimensional data. When the number of predictors \(p\) is greater than the number of observations \(n\), it will be possible to create a model explaining 100% of the variance in the data, which will usually lead to overfitting and poor prediction for new data.
Real Life Applications:
- DNA Sequencing
- GDP
- Housing prices
Subset Selection
One method of improving the ordinary least squares regression is by selecting a subset of the predictor variables to train the model based on how the model performs. Having fewer predictor variables improves the ability to interpret the model because it is easier to view the relationship between the response variable and the predictors. Fewer variables would also decrease the risk of overfitting the model on the training data. The predictor variables are selected based on their performance on OLS, and only the predictors that improve performance are used. The two main processes for choosing which variables to include are best subsets and stepwise regression.
Best Subsets
One method of finding the best subsets is by testing all possible combinations of these predictors to determine which model performs the best, and the combination of predictors that performs the best is selected to be used. Best subsets returns the best model for each number of predictors included, so the best model with one predictor, with two predictors, and so on. A model is typically defined as performing better if it has the minimum MSE, or residual sum of squares, values, but any of the model selection criteria reviewed above may be used. The different model selection criteria may select different combinations of predictors as performing the “best”, so it is valuable to consider multiple options. While this process is simple to implement, the time to calculate all the combinations exponentially increases as the number of potential predictors increases. For example, a model with 10 predictors has 1024 possible combinations. Therefore, this may only be possible if the data includes a small number of variables. It is possible to arbitrarily select the maximum number of predictors to include, but the time to calculate the best predictors to use may still take too long to work as a viable solution. The user must also review the best models for each number of predictions and choose the best one based on the results and context of the problem.
Stepwise-type Procedure
Stepwise-type procedures determine which predictor variables to include by adding or deleting predictors one at a time; the process evaluates fewer models and is less computationally burdensome than comparing all possible subsets. Three categories of stepwise-type procedures include forward selection, backward elimination, and stepwise regression (which is a combination of the previous two).
Forward Selection: The initial model includes only the intercept value and no predictors. Predictors will be added to the model one at a time, as long as the F statistic is greater than a preselected F value called \(F_{IN}\). Most programs utilize a default \(F_{IN}\), so this does not need to be set in order to perform the calculation. The first predictor added to the model is the one with the largest correlation with the response variable; this predictor also generates the most significant F statistic. The F statistic must exceed \(F_{IN}\) to be added to the model. For the second step, the next predictor selected will have the highest correlation with the response variable, given the presence of the first predictor. The partial F (or t) statistic for the second predictor must also exceed \(F_{IN}\). If \(F_{IN}\) is larger, then the predictor is not added to the model and no further predictors are considered to be added to the model. Step 2 is repeated until the partial F statistic of the predictor to be added is less than \(F_{IN}\) or until all predictors have been added to the model.
Backward Elimination: The initial model for backward elimination begins with all of the predictor variables included in the model and drops them one by one until a suitable model is found. This method is preferred if you want to confirm all predictors have been considered and nothing was missed. Backward elimination determines if predictors should be dropped by comparing the partial F (or t) statistic to a preselected value \(F_{OUT}\). Similar to \(F_{IN}\), programs will use a default value for \(F_{OUT}\). The first step of implementing backward elimination is to calculate the partial F statistic for all the predictors. The smallest partial F statistic is compared to \(F_{OUT}\) and if it is smaller than \(F_{OUT}\), then that predictor is removed from the model. The partial F statistics are recalculated given the updated model, and the smallest partial F statistic for the new model is compared to \(F_{OUT}\). The process of removing predictors will continue until the smallest partial F statistic exceeds \(F_{OUT}\).
Stepwise Regression: Stepwise regression is a combination of forward selection and backward elimination. The initial model assumes only the intercept is included, and predictors are added one at a time. The difference is after a predictor is added, all the predictors in the model will be reviewed to determine if they should still be included. In this method, both \(F_{IN}\) and \(F_{OUT}\) must be preselected. There is no required relationship between \(F_{IN}\) and \(F_{OUT}\). Some people prefer to set the two values equal to each other; others would rather set \(F_{IN}\) > \(F_{OUT}\) so it is more difficult to add a predictor than drop one.
The initial model is the same as forward selection, and each predictor added must meet the same criteria as outlined in forward selection. After adding a new predictor, the partial F statistic is calculated for each of the predictors in the model, and the partial F statistic is compared to \(F_{OUT}\). One important difference is all partial F statistics less than \(F_{OUT}\) will be dropped, while only the smallest partial F statistic is dropped in backward elimination. The model will be complete when the partial F statistic for adding a predictor is less than \(F_{IN}\).
Code Implementation
We will show how to implement the various methods using an online news popularity data set. The data set includes 58 predictors and 39644 rows of observations. Examples of the predictors include number of words in the title, number of words in the article, a binary value for each day of the week, and average polarity of the words used. The two non-predictive columns we removed are url and timedelta, which is the time between the article being published and being added to the dataset. The goal is to determine which of the 58 predictors are most useful in using linear regression to find the number of times an article is shared.
The code below shows how to perform all three of the stepwise procedures. The output in R typically shows detailed steps of the different models, but only the summaries have been included here.
# Load the data
news = read.csv('OnlineNewsPopularity.csv') %>% select(-url, -timedelta)
# Define intercept only model
regnull <- lm(shares~1, data=news)
# Define model with all predictors
regfull <- lm(shares~., data=news)
# Perform Forward Selection, start with regnull
forward_selection = step(regnull, scope=list(lower=regnull, upper=regfull), direction="forward")
# Perform Backward Elimination, start with regfull
backward_elimination = step(regfull, scope=list(lower=regnull, upper=regfull), direction="backward")
# Perform Stepwise Regression, start with regnull
stepwise_regression = step(regnull, scope=list(lower=regnull, upper=regfull), direction="both")
# Results from Forward Selection
summary(forward_selection)
##
## Call:
## lm(formula = shares ~ kw_avg_avg + self_reference_min_shares +
## kw_max_avg + kw_min_avg + num_hrefs + kw_max_max + avg_negative_polarity +
## data_channel_is_entertainment + LDA_03 + average_token_length +
## global_subjectivity + n_tokens_title + weekday_is_monday +
## kw_min_max + LDA_02 + weekday_is_saturday + num_self_hrefs +
## n_tokens_content + self_reference_max_shares + kw_max_min +
## data_channel_is_lifestyle + num_keywords + global_rate_positive_words +
## min_positive_polarity + abs_title_sentiment_polarity + abs_title_subjectivity +
## num_imgs + kw_min_min, data = news)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34610 -2243 -1232 -127 837695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.414e+03 7.193e+02 -1.965 0.049368 *
## kw_avg_avg 1.646e+00 1.268e-01 12.979 < 2e-16 ***
## self_reference_min_shares 2.261e-02 3.366e-03 6.718 1.87e-11 ***
## kw_max_avg -2.001e-01 2.354e-02 -8.502 < 2e-16 ***
## kw_min_avg -3.600e-01 7.351e-02 -4.897 9.77e-07 ***
## num_hrefs 2.719e+01 6.408e+00 4.243 2.21e-05 ***
## kw_max_max -6.776e-04 5.377e-04 -1.260 0.207663
## avg_negative_polarity -1.485e+03 5.215e+02 -2.848 0.004406 **
## data_channel_is_entertainment -9.108e+02 1.643e+02 -5.544 2.98e-08 ***
## LDA_03 5.475e+02 2.595e+02 2.109 0.034908 *
## average_token_length -2.908e+02 9.627e+01 -3.020 0.002526 **
## global_subjectivity 2.441e+03 7.549e+02 3.234 0.001223 **
## n_tokens_title 8.965e+01 2.833e+01 3.164 0.001555 **
## weekday_is_monday 4.636e+02 1.557e+02 2.977 0.002917 **
## kw_min_max -2.398e-03 1.092e-03 -2.197 0.028047 *
## LDA_02 -8.297e+02 2.468e+02 -3.362 0.000776 ***
## weekday_is_saturday 5.886e+02 2.424e+02 2.428 0.015181 *
## num_self_hrefs -5.295e+01 1.725e+01 -3.069 0.002149 **
## n_tokens_content 1.934e-01 1.537e-01 1.259 0.208126
## self_reference_max_shares 3.641e-03 1.660e-03 2.193 0.028317 *
## kw_max_min 3.506e-02 1.890e-02 1.855 0.063632 .
## data_channel_is_lifestyle -5.460e+02 2.684e+02 -2.034 0.041951 *
## num_keywords 6.333e+01 3.356e+01 1.887 0.059158 .
## global_rate_positive_words -7.843e+03 4.160e+03 -1.886 0.059366 .
## min_positive_polarity -1.848e+03 9.263e+02 -1.995 0.046006 *
## abs_title_sentiment_polarity 5.840e+02 2.841e+02 2.055 0.039849 *
## abs_title_subjectivity 6.144e+02 3.419e+02 1.797 0.072362 .
## num_imgs 1.153e+01 7.999e+00 1.442 0.149423
## kw_min_min 2.287e+00 1.613e+00 1.418 0.156278
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11500 on 39615 degrees of freedom
## Multiple R-squared: 0.02257, Adjusted R-squared: 0.02188
## F-statistic: 32.67 on 28 and 39615 DF, p-value: < 2.2e-16
# Results from Backward Elimination
summary(backward_elimination)
##
## Call:
## lm(formula = shares ~ n_tokens_title + n_tokens_content + n_unique_tokens +
## n_non_stop_words + num_hrefs + num_self_hrefs + num_imgs +
## average_token_length + num_keywords + data_channel_is_lifestyle +
## data_channel_is_entertainment + data_channel_is_bus + data_channel_is_socmed +
## data_channel_is_tech + data_channel_is_world + kw_min_min +
## kw_max_min + kw_min_max + kw_min_avg + kw_max_avg + kw_avg_avg +
## self_reference_min_shares + self_reference_max_shares + weekday_is_monday +
## weekday_is_saturday + LDA_00 + LDA_03 + LDA_04 + global_subjectivity +
## global_rate_positive_words + min_positive_polarity + avg_negative_polarity +
## abs_title_subjectivity + abs_title_sentiment_polarity, data = news)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33664 -2239 -1225 -125 837641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.792e+03 6.872e+02 -2.608 0.009113 **
## n_tokens_title 9.059e+01 2.838e+01 3.192 0.001413 **
## n_tokens_content 5.146e-01 2.027e-01 2.539 0.011135 *
## n_unique_tokens 2.197e+03 9.173e+02 2.395 0.016626 *
## n_non_stop_words -1.474e+03 6.175e+02 -2.386 0.017023 *
## num_hrefs 2.630e+01 6.440e+00 4.083 4.46e-05 ***
## num_self_hrefs -5.313e+01 1.750e+01 -3.036 0.002402 **
## num_imgs 1.441e+01 8.198e+00 1.757 0.078856 .
## average_token_length -2.689e+02 9.748e+01 -2.759 0.005800 **
## num_keywords 6.037e+01 3.441e+01 1.754 0.079357 .
## data_channel_is_lifestyle -1.113e+03 3.774e+02 -2.948 0.003199 **
## data_channel_is_entertainment -1.114e+03 2.430e+02 -4.584 4.58e-06 ***
## data_channel_is_bus -9.359e+02 3.701e+02 -2.529 0.011446 *
## data_channel_is_socmed -6.709e+02 3.510e+02 -1.911 0.055974 .
## data_channel_is_tech -6.443e+02 3.520e+02 -1.830 0.067183 .
## data_channel_is_world -7.234e+02 3.006e+02 -2.407 0.016092 *
## kw_min_min 3.825e+00 8.895e-01 4.300 1.72e-05 ***
## kw_max_min 3.567e-02 1.891e-02 1.886 0.059313 .
## kw_min_max -2.346e-03 1.093e-03 -2.147 0.031806 *
## kw_min_avg -3.462e-01 7.405e-02 -4.676 2.94e-06 ***
## kw_max_avg -1.893e-01 2.381e-02 -7.952 1.88e-15 ***
## kw_avg_avg 1.566e+00 1.293e-01 12.112 < 2e-16 ***
## self_reference_min_shares 2.255e-02 3.366e-03 6.698 2.15e-11 ***
## self_reference_max_shares 3.642e-03 1.661e-03 2.193 0.028322 *
## weekday_is_monday 4.741e+02 1.557e+02 3.044 0.002334 **
## weekday_is_saturday 5.908e+02 2.426e+02 2.436 0.014870 *
## LDA_00 1.093e+03 4.435e+02 2.464 0.013751 *
## LDA_03 5.936e+02 3.245e+02 1.829 0.067412 .
## LDA_04 7.150e+02 4.024e+02 1.777 0.075603 .
## global_subjectivity 2.550e+03 7.603e+02 3.355 0.000796 ***
## global_rate_positive_words -8.893e+03 4.210e+03 -2.112 0.034669 *
## min_positive_polarity -2.256e+03 9.466e+02 -2.383 0.017187 *
## avg_negative_polarity -1.444e+03 5.247e+02 -2.752 0.005932 **
## abs_title_subjectivity 6.309e+02 3.421e+02 1.844 0.065150 .
## abs_title_sentiment_polarity 6.003e+02 2.842e+02 2.112 0.034675 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11500 on 39609 degrees of freedom
## Multiple R-squared: 0.02279, Adjusted R-squared: 0.02195
## F-statistic: 27.17 on 34 and 39609 DF, p-value: < 2.2e-16
# Results from Stepwise Regression
summary(stepwise_regression)
##
## Call:
## lm(formula = shares ~ kw_avg_avg + self_reference_min_shares +
## kw_max_avg + kw_min_avg + num_hrefs + avg_negative_polarity +
## data_channel_is_entertainment + LDA_03 + average_token_length +
## global_subjectivity + n_tokens_title + weekday_is_monday +
## kw_min_max + LDA_02 + weekday_is_saturday + num_self_hrefs +
## self_reference_max_shares + kw_max_min + data_channel_is_lifestyle +
## num_keywords + global_rate_positive_words + min_positive_polarity +
## abs_title_sentiment_polarity + abs_title_subjectivity + num_imgs +
## kw_min_min, data = news)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34139 -2236 -1237 -130 837744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.836e+03 6.182e+02 -2.970 0.002980 **
## kw_avg_avg 1.605e+00 1.243e-01 12.918 < 2e-16 ***
## self_reference_min_shares 2.253e-02 3.365e-03 6.695 2.19e-11 ***
## kw_max_avg -1.950e-01 2.331e-02 -8.363 < 2e-16 ***
## kw_min_avg -3.528e-01 7.339e-02 -4.807 1.54e-06 ***
## num_hrefs 2.938e+01 6.148e+00 4.779 1.77e-06 ***
## avg_negative_polarity -1.524e+03 5.206e+02 -2.928 0.003413 **
## data_channel_is_entertainment -9.021e+02 1.637e+02 -5.511 3.59e-08 ***
## LDA_03 5.338e+02 2.555e+02 2.089 0.036700 *
## average_token_length -2.917e+02 9.625e+01 -3.031 0.002440 **
## global_subjectivity 2.490e+03 7.542e+02 3.301 0.000963 ***
## n_tokens_title 8.926e+01 2.827e+01 3.158 0.001592 **
## weekday_is_monday 4.640e+02 1.557e+02 2.979 0.002890 **
## kw_min_max -2.361e-03 1.092e-03 -2.163 0.030528 *
## LDA_02 -8.253e+02 2.458e+02 -3.357 0.000789 ***
## weekday_is_saturday 5.884e+02 2.424e+02 2.427 0.015210 *
## num_self_hrefs -5.039e+01 1.716e+01 -2.937 0.003319 **
## self_reference_max_shares 3.645e-03 1.660e-03 2.196 0.028128 *
## kw_max_min 3.580e-02 1.890e-02 1.895 0.058146 .
## data_channel_is_lifestyle -5.219e+02 2.681e+02 -1.947 0.051548 .
## num_keywords 6.242e+01 3.354e+01 1.861 0.062774 .
## global_rate_positive_words -7.299e+03 4.147e+03 -1.760 0.078396 .
## min_positive_polarity -2.097e+03 9.018e+02 -2.325 0.020071 *
## abs_title_sentiment_polarity 5.902e+02 2.841e+02 2.078 0.037761 *
## abs_title_subjectivity 6.279e+02 3.418e+02 1.837 0.066231 .
## num_imgs 1.429e+01 7.681e+00 1.861 0.062787 .
## kw_min_min 3.932e+00 8.828e-01 4.454 8.46e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11500 on 39617 degrees of freedom
## Multiple R-squared: 0.0225, Adjusted R-squared: 0.02185
## F-statistic: 35.07 on 26 and 39617 DF, p-value: < 2.2e-16
Results
As seen above, the three stepwise-type procedures do not give the same results as each other. Forward selection and stepwise regression have the same output in this scenario, but it is not guaranteed the results from forward selection and stepwise regression would be equal for a different model. Backward elimination includes 2 more predictor variables than the other two methods, which results in a higher adjusted \(R^2\) value. The residual standard errors are the same, and this shows how different combinations of predictors can have similar performances. It is valuable to perform all three of these processes and pick a model based on the context of the problem.
Pros & Cons
Pros
Subset selection is easy to implement and may work with categorical and continuous predictors.
The coefficients assigned to the predictors selected using subset selection are easy to interpret, which is useful for inference problems.
The three stepwise-type procedures are able to be used when there are many predictors.
Cons
When there are a large number of predictors possible, the time to run the best subset formula can become restrictive.
Results of the best subset and stepwise-type procedure may vary based on the model selection parameters used.
If the number of observations is small, then a small change in the data may result in a significant change to the model.
These methods only work for supervised learning problems.
Penalized Regression
The subset method focuses on using least squares to improve on regression using an “all or nothing” approach. The results from the subset selection methods can vary based on the model used, and the methods in subset selection are not often robust to data of all sizes. rendering the subset selection method unstable. Shrinkage methods are useful in cases where subset selection methods are unstable. Shrinkage methods shrink the coefficients closer to 0 to help improve model fit and predictive power by reducing the variance of the coefficient estimates generated. Shrinkage accomplishes this by using a penalized regression method, where a penalty is imposed for having many variables in the model. Penalized regression methods are available for linear and logistic regression depending on the type of outcome we are trying to predict.
There are three well known methods for penalized regression: ridge, lasso, and elastic net. Each method imposes the penalty by using a tuning parameter called \(\lambda\), the \(\lambda\) value is usually selected using cross-validation. The minimum \(\lambda\) value generated is the best tuning parameter used in each of the models. The primary difference in the three methods is in their choice of \(\alpha\). Ridge regression uses an \(\alpha\) value of 0 and lasso regression uses an \(\alpha\) value of 1. Elastic net regression provides the flexibility to choose an \(\alpha\) value between 0 and 1.
Code Implementation
The code below is a demonstration of implementing penalized regression using ridge, lasso, and elastic net regression methods. Since penalized regression aims to improve predictive power of regression models, we determine the predictive performance of each of the three regression methods. In order to generate predictions, we begin by splitting our dataset into train and test data. The train data will be used to build the models, whereas the predictions will be made on the test data. Two parameters, RMSE, Root Mean Squared Error, and \(R^2\) will help to determine the performance of each of the three models.The glmnet package in R is very efficient for fitting ridge, lasso, and elastic net models, for both penalized linear and logistic regression models.
Import the libraries
library(tidyverse) #for data visualization and manipulation
library(MASS) #for ridge regression
library(broom)
library(glmnet) #for ridge, lasso, and elastic net regression
library(caret) #to partition data into train and test sets
Exploratory analysis on the data
The summary of the dependent variable shows that it has a wide range of values ranging from 1 - 843,300, with a mean of 3395.
#Import the dataset
news = read.csv("OnlineNewsPopularity.csv")
#Drop non-predictive variables from the dataset
news = subset(news, select = -c(url,timedelta))
summary(news$shares)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 946 1400 3395 2800 843300
#Split the data into train and test data
set.seed(123)
split = sort(sample(nrow(news), nrow(news)*.8)) #80% train samples
train<-news[split,]
test<-news[-split,]
We will compute the penalized linear regression on the train data set.
#Specify the predictor and response variables
#y is the response variable --> shares
y.train <- train$shares
#x is the matrix with the predictor variables
x.train <- model.matrix(train$shares~., train)
#Remove the intercept column that X.train generates intercept in the first column
x.train = x.train[,-1]
Ridge Regression
The set up of a ridge regression model begins by setting up the 10-fold cross-validation model to determine the tuning parameter, \(\lambda\). The cross-validation model is set up by using the glmnet package in R. The parameters needed to compute the \(\lambda\) value include \(\alpha\) = 1, and the number of cross-validation folds. Ridge regression has a lower mean square prediction error compared to least squares. When \(\lambda\) is 0, the variance is the same as that computed by least squares. As \(\lambda\) increases, the variance decreases, when \(\lambda\) approaches \(\infty\), the variance approaches closer to 0.
In order to compute ridge regression, we set the \(\alpha\) value to 0. The glmnet package fits the model and computes the tuning parameter, \(\lambda\).
# Find the best lambda using cross-validation
set.seed(202)
cv_ridge <- cv.glmnet(x.train, y.train, alpha = 0)
# Display the best lambda value
cv_ridge$lambda.min
## [1] 1196.618
# Fit the ridge regression model on the training data
ridge <- glmnet(x.train, y.train, alpha = 0, lambda = cv_ridge$lambda.min)
# Display regression coefficients
coef(ridge)
## 59 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 5.315400e+02
## n_tokens_title 6.447372e+01
## n_tokens_content -1.233582e-01
## n_unique_tokens 3.095086e+00
## n_non_stop_words 1.136617e-01
## n_non_stop_unique_tokens 3.324189e+00
## num_hrefs 3.110251e+01
## num_self_hrefs -5.226171e+01
## num_imgs 2.120434e+01
## num_videos 2.009726e+01
## average_token_length -2.146994e+02
## num_keywords 3.525640e+01
## data_channel_is_lifestyle -5.043507e+02
## data_channel_is_entertainment -1.121642e+03
## data_channel_is_bus -6.985487e+02
## data_channel_is_socmed -2.927150e+02
## data_channel_is_tech -4.789197e+02
## data_channel_is_world -4.646598e+02
## kw_min_min 2.416062e+00
## kw_max_min -1.389527e-02
## kw_avg_min -6.159044e-02
## kw_min_max -2.003396e-03
## kw_max_max -3.021239e-04
## kw_avg_max 6.100977e-04
## kw_min_avg -7.823197e-02
## kw_max_avg -4.934097e-02
## kw_avg_avg 7.474319e-01
## self_reference_min_shares 7.636801e-03
## self_reference_max_shares 2.470708e-03
## self_reference_avg_sharess 1.821307e-03
## weekday_is_monday 3.037723e+02
## weekday_is_tuesday -1.951195e+02
## weekday_is_wednesday -3.873783e+01
## weekday_is_thursday -1.127246e+02
## weekday_is_friday -7.796023e+01
## weekday_is_saturday 3.776961e+02
## weekday_is_sunday 1.242207e+01
## is_weekend 1.794889e+02
## LDA_00 2.807948e+02
## LDA_01 -3.192547e+02
## LDA_02 -9.001340e+02
## LDA_03 7.429552e+02
## LDA_04 2.160582e+01
## global_subjectivity 2.548915e+03
## global_sentiment_polarity -2.463874e+02
## global_rate_positive_words -2.380978e+03
## global_rate_negative_words 7.879721e+02
## rate_positive_words -1.522252e+02
## rate_negative_words -3.742291e+02
## avg_positive_polarity -7.207101e+02
## min_positive_polarity -1.169008e+03
## max_positive_polarity 9.704761e+01
## avg_negative_polarity -2.034620e+02
## min_negative_polarity -3.017380e+02
## max_negative_polarity -8.356904e+02
## title_subjectivity 4.435000e+01
## title_sentiment_polarity 1.168851e+02
## abs_title_subjectivity 8.025450e+02
## abs_title_sentiment_polarity 6.942200e+02
Next, we will use the ridge regression model constructed with the train data to make predictions on the test data.
# Make predictions on the test data
x.test <- model.matrix(test$shares ~., test)#[,-1]
#Drop the first column from the x.test matrix which is the intercept
x.test <- x.test[,-1]
predictions_ridge <- ridge %>% predict(x.test)
# Model performance metrics
RMSE_ridge = RMSE(predictions_ridge, test$shares)
Rsquare_ridge = R2(predictions_ridge, test$shares)
data.frame(RMSE = RMSE_ridge, R2 = Rsquare_ridge)
## RMSE s0
## 1 14274.39 0.01783234
Pros
- The ridge penalty decreases the variance of the data. A decreased variance allows for better model performance. However, decrease in variance increases the bias. The bias-variance trade off allows for improved model performance.
- The computational efficiency of ridge regression is higher compared to subset selection.
Cons 1. The ridge penalty increases bias in the model. 2. The ridge penalty includes all predictors in the model, unlike subset selection methods where a subset of the predictor variables are considered, not giving the user a chance to create a model using a subset of variables. 3. The ability to interpret from models produced using a ridge penalty is not very high.
Lasso Regression
The set up of a lasso penalty is very similar to that of the ridge penalty. We set up a 10-fold cross-validation model to determine the tuning parameter \(\lambda\). Similar to ridge regression, we use the glmnet package in R to compute the \(\lambda\) value. As mentioned earlier, the key difference between ridge and lasso is the choice of \(\alpha\). An \(\alpha\) value of 1 is used to construct a lasso model along with the specified number of cross-validation folds. One advantage of lasso over ridge is that lasso can shrink coefficients completely to 0, unlike the ridge penalty that shrinks coefficients close to 0. Shrinking coefficients down to 0 removes unhelpful predictor variables from the model. This is similar to the subset selection method where only a subset of the predictor variables are considered.
In order to compute lasso regression, we set the \(\alpha\) value to 1. The glmnet package fits the model and computes the tuning parameter, \(\lambda\).
# Find the best lambda using cross-validation
set.seed(202)
cv_lasso <- cv.glmnet(x.train, y.train, alpha = 1)
# Display the best lambda value
cv_lasso$lambda.min
## [1] 28.29391
# Fit the ridge regression model on the training data
lasso <- glmnet(x.train, y.train, alpha = 1, lambda = cv_lasso$lambda.min)
# Display regression coefficients
coef(lasso)
## 59 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -6.020172e+02
## n_tokens_title 6.122662e+01
## n_tokens_content .
## n_unique_tokens 7.075435e-02
## n_non_stop_words .
## n_non_stop_unique_tokens .
## num_hrefs 3.094027e+01
## num_self_hrefs -4.683322e+01
## num_imgs 1.641107e+01
## num_videos 1.457825e+01
## average_token_length -2.242244e+02
## num_keywords 7.569121e+00
## data_channel_is_lifestyle -1.607200e+02
## data_channel_is_entertainment -9.394971e+02
## data_channel_is_bus -1.611592e+02
## data_channel_is_socmed .
## data_channel_is_tech -5.117892e+01
## data_channel_is_world .
## kw_min_min 2.681926e+00
## kw_max_min .
## kw_avg_min -9.819949e-02
## kw_min_max -1.734523e-03
## kw_max_max -3.127683e-04
## kw_avg_max .
## kw_min_avg -2.493154e-01
## kw_max_avg -1.341927e-01
## kw_avg_avg 1.288923e+00
## self_reference_min_shares 8.446495e-03
## self_reference_max_shares 2.621438e-03
## self_reference_avg_sharess .
## weekday_is_monday 3.422202e+02
## weekday_is_tuesday -8.946535e+01
## weekday_is_wednesday .
## weekday_is_thursday .
## weekday_is_friday .
## weekday_is_saturday 3.261783e+02
## weekday_is_sunday .
## is_weekend 1.982845e+02
## LDA_00 3.424033e+01
## LDA_01 -4.344960e+01
## LDA_02 -7.586311e+02
## LDA_03 7.852135e+02
## LDA_04 .
## global_subjectivity 2.263121e+03
## global_sentiment_polarity .
## global_rate_positive_words .
## global_rate_negative_words .
## rate_positive_words .
## rate_negative_words .
## avg_positive_polarity -4.838425e+02
## min_positive_polarity -9.829018e+02
## max_positive_polarity .
## avg_negative_polarity -3.833825e+02
## min_negative_polarity -1.155690e+02
## max_negative_polarity -5.681276e+02
## title_subjectivity .
## title_sentiment_polarity .
## abs_title_subjectivity 6.493182e+02
## abs_title_sentiment_polarity 6.784563e+02
Next, we will use the lasso regression model constructed with the train data to make predictions on the test data.
# Make predictions on the test data
predictions_lasso <- lasso %>% predict(x.test)
# Model performance metrics
RMSE_lasso = RMSE(predictions_lasso, test$shares)
Rsquare_lasso = R2(predictions_lasso, test$shares)
data.frame(RMSE = RMSE_lasso, R2 = Rsquare_lasso)
## RMSE s0
## 1 14268.18 0.01866362
Pros
The lasso penalty can remove unhelpful predictors from the model by setting their coefficients to 0, leaving only useful predictors to improve predictive power.
Models generated using the lasso penalty have better interpretability compared to the ridge penalty.
Lasso regression is also much more computationally efficient compared to the subset selection method.
Lasso regression, like ridge regression, has a bias-variance trade-off, and decreases variance, thus improving model performance.
Cons
- Since the lasso penalty removes predictors, it increases bias by only incorporating certain predictors.
From the description and results above, we can see that the both ridge and lasso penalties have slight differences, but have similar performance, and one does not outperform the other in all aspects.
Elastic Net Regression
The elastic net regression penalty is slightly different than the lasso and ridge penalties. Here the \(\alpha\) value can be between 0 and 1. A random \(\alpha\) value between 0 and 1 can be chosen, or the caret package in R can help automatically select the best \(\alpha\) and \(\lambda\) parameters. A combination of the best \(\alpha\) and \(\lambda\) minimizes the error rate.
Pros
- Elastic net avoids hard coding the \(\alpha\) value and automatically selects the best combination of \(\alpha\) and \(\lambda\) values to fit the model. The RMSE and \(R^2\) values are the lowest for the elastic net model, indicating that it performs better than the lasso and ridge regression models, since it results in the least prediction error.
Cons
- The elastic net model is computationally inefficient in comparison with lasso and ridge regression.
Elastic net can be computed using two methods. The first method will automatically compute the best alpha and lambda values.
#Define the model and find the best lambda using 10-fold cross validation
set.seed(202)
elasticnet <- train(shares ~., data = train, method = "glmnet", trControl = trainControl("cv", number = 10), tuneLength = 10)
#Compute coefficients for the model
coef(elasticnet$finalModel, elasticnet$bestTune$lambda)
## 59 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -6.086767e+02
## n_tokens_title 6.096168e+01
## n_tokens_content .
## n_unique_tokens 9.864408e-02
## n_non_stop_words .
## n_non_stop_unique_tokens .
## num_hrefs 3.092007e+01
## num_self_hrefs -4.685020e+01
## num_imgs 1.641840e+01
## num_videos 1.463544e+01
## average_token_length -2.242787e+02
## num_keywords 7.958360e+00
## data_channel_is_lifestyle -1.455686e+02
## data_channel_is_entertainment -9.495347e+02
## data_channel_is_bus -1.571203e+02
## data_channel_is_socmed .
## data_channel_is_tech -3.980603e+01
## data_channel_is_world .
## kw_min_min 2.709793e+00
## kw_max_min .
## kw_avg_min -9.634882e-02
## kw_min_max -1.722391e-03
## kw_max_max -2.880619e-04
## kw_avg_max .
## kw_min_avg -2.414621e-01
## kw_max_avg -1.310833e-01
## kw_avg_avg 1.269401e+00
## self_reference_min_shares 8.432876e-03
## self_reference_max_shares 2.625916e-03
## self_reference_avg_sharess .
## weekday_is_monday 3.411827e+02
## weekday_is_tuesday -8.906019e+01
## weekday_is_wednesday .
## weekday_is_thursday .
## weekday_is_friday .
## weekday_is_saturday 3.266023e+02
## weekday_is_sunday .
## is_weekend 1.971661e+02
## LDA_00 5.131292e+01
## LDA_01 -6.873845e+00
## LDA_02 -7.454948e+02
## LDA_03 8.267474e+02
## LDA_04 .
## global_subjectivity 2.271835e+03
## global_sentiment_polarity .
## global_rate_positive_words .
## global_rate_negative_words .
## rate_positive_words .
## rate_negative_words .
## avg_positive_polarity -4.850011e+02
## min_positive_polarity -9.764213e+02
## max_positive_polarity .
## avg_negative_polarity -2.842540e+02
## min_negative_polarity -1.444689e+02
## max_negative_polarity -6.330207e+02
## title_subjectivity .
## title_sentiment_polarity .
## abs_title_subjectivity 6.477365e+02
## abs_title_sentiment_polarity 6.781005e+02
#above we are getting coefficients for the best model
Elastic net can also be computed using glmnet directly in R. Here we need to specify an \(\alpha\) value of our choice, and the model computes the best lambda value.
# Find the best lambda using cross-validation
set.seed(202)
cv_elasticnet = cv.glmnet(x = x.train, y=train$shares, alpha=0.8)
lambda.hat = cv_elasticnet$lambda.min
lambda.hat
## [1] 35.36738
# Fit the elastic net regression model on the training data
elasticnet_glm <- glmnet(x.train, y.train, alpha = 0.8, lambda = cv_elasticnet$lambda.min)
# Display regression coefficients
coef(elasticnet_glm)
## 59 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -5.885660e+02
## n_tokens_title 6.110030e+01
## n_tokens_content .
## n_unique_tokens 6.787211e-02
## n_non_stop_words .
## n_non_stop_unique_tokens .
## num_hrefs 3.093699e+01
## num_self_hrefs -4.684245e+01
## num_imgs 1.642880e+01
## num_videos 1.458389e+01
## average_token_length -2.245860e+02
## num_keywords 7.698989e+00
## data_channel_is_lifestyle -1.598475e+02
## data_channel_is_entertainment -9.422223e+02
## data_channel_is_bus -1.625094e+02
## data_channel_is_socmed .
## data_channel_is_tech -5.241918e+01
## data_channel_is_world .
## kw_min_min 2.681249e+00
## kw_max_min .
## kw_avg_min -9.873371e-02
## kw_min_max -1.731488e-03
## kw_max_max -3.063569e-04
## kw_avg_max .
## kw_min_avg -2.456792e-01
## kw_max_avg -1.326012e-01
## kw_avg_avg 1.279411e+00
## self_reference_min_shares 8.445752e-03
## self_reference_max_shares 2.621808e-03
## self_reference_avg_sharess .
## weekday_is_monday 3.417161e+02
## weekday_is_tuesday -8.943150e+01
## weekday_is_wednesday .
## weekday_is_thursday .
## weekday_is_friday .
## weekday_is_saturday 3.262061e+02
## weekday_is_sunday .
## is_weekend 1.981741e+02
## LDA_00 3.526992e+01
## LDA_01 -4.001172e+01
## LDA_02 -7.628426e+02
## LDA_03 7.929370e+02
## LDA_04 .
## global_subjectivity 2.262254e+03
## global_sentiment_polarity .
## global_rate_positive_words .
## global_rate_negative_words .
## rate_positive_words .
## rate_negative_words .
## avg_positive_polarity -4.824727e+02
## min_positive_polarity -9.811835e+02
## max_positive_polarity .
## avg_negative_polarity -3.842968e+02
## min_negative_polarity -1.153100e+02
## max_negative_polarity -5.683147e+02
## title_subjectivity .
## title_sentiment_polarity .
## abs_title_subjectivity 6.486684e+02
## abs_title_sentiment_polarity 6.780768e+02
Next, we will use the elastic net regression model constructed with the train data to make predictions on the test data.
#Predictions on the test data
predictions_elasticnet <- elasticnet %>% predict(x.test)
# Evaluate the model performance using RMSE and R2 metrics
RMSE_1 = min(elasticnet$results$RMSE)
Rsquare_1 = min(elasticnet$results$Rsquared)
data.frame(RMSE = RMSE_1, R2 = Rsquare_1)
## RMSE R2
## 1 9665.946 0.02007253
Performance for all 3 models
As we can see, all 3 models easily generate predictions using the glmnet package in R. In terms of model performance, we can see that the best RMSE generated using the elastic net penalty is lower than the ridge and lasso models. Both ridge and Lasso models have very close RMSE and \(R^2\) values, whereas the elastic net method outperforms them in terms of RMSE. Penalized regression is easy to compute and is most helpful when we have large datasets that are multivariate. It is recommended to compute all 3 models and to proceed with the model specific to the problem and the one that produces the best predictions with the lowest RMSE.
Dimension Reduction
In the methods described so far, we have focused on feature selection and eliminating features that we decide are not significant enough for our model. However, in entirely dropping these features - however insignificant they may be - we completely miss out on any benefits these dropped variables may contribute. Instead, we could incorporate all variables strategically in order to extract as much information as possible using dimension reduction. Dimension reduction entails creating an \(M\)-dimensional subspace where \(M\) combinations are constructed as a combination of \(p\) predictors. The \(M\) combinations of predictors are then used to fit a least squares linear regression model. In order to benefit from this approach, \(M < p\) must hold. If \(M = p\), then the resulting model amounts to using a least squares fit using all of the original predictors. We especially benefit from this approach when \(p\) (number of predictors) is large relative to n (number of observations), as using \(M < p\) can reduce the variance of the model.
Dimension reduction methods generally follow two steps:
1. Obtain the transformed predictors \(Z_1\), \(Z_2\),…\(Z_M\)
2. Fit the model using these predictors
In the following sections, we will discuss two popular methods for dimension reduction: Principal Component Analysis (PCA) and Partial Least Squares (PLS).
Principal Components Analysis (PCA)
Principal Components Analysis is a dimension reduction technique in which variables named principal components are constructed as linear combinations of our original variables. There are as many principal components created as there are original variables in our data. However, in order to attain a model with reduced dimensions, we will aim to drop a number of these principal components. To determine which principal components can be excluded from our model, we will rank the principal components by their importance in predicting the response variable. Then, only the most important principal components are used in the linear model.
We will likely want to use this method when we seek to reduce the number of variables, but do not know which ones. However, we will see that one drawback of this method is its interpretability, as we are no longer using our original predictors to predict the response but are instead using transformed predictors that each include all of our original predictors.
Before we dive into the algorithm and steps behind PCA, it will be helpful to review how and why we use variance to rank principal components by importance.
Constructing Principal Components
Principal components represent the directions of the data that explain a maximal amount of variance. In understanding this method, it’s important to recognize that variance explained is equivalent to information captured in a model. The more variance in the data that we can explain through a feature, the better.
The image below will help explain how this concept ties into principal component analysis.
(Source: Z, Jaadi, “A Step-by-Step Explanation of Principal Component Analysis.)
The goal of principal component analysis is to compress as much information or variance in the first components, and thus the first principal component accounts for the largest variance in the dataset.
In the scatterplot above, we aim to construct a principal component or line that maximizes this variance, which is when the red dots are spread out the most along the line. This line then represents more variance and the larger the variance represented by a line, the larger the dispersion of points along a line, and thus the more information the line holds.
Now that we understand how these principal components contribute to our model, let’s walk through the steps taken to construct the principal components.
Algorithm Behind the Method:
- Center and standardize our original predictor variable matrix, \(X\).
- This method is sensitive to variance in the predictor variables, so this step is necessary to ensure each variable contributes equally to the model
- This method is sensitive to variance in the predictor variables, so this step is necessary to ensure each variable contributes equally to the model
- Compute the covariance matrix by multiplying the transpose of this standardized matrix by itself.
- Thus, where X has been standardized: \(X^TX\)
- Thus, where X has been standardized: \(X^TX\)
- Compute the eigenvectors and eigenvalues of the covariance matrix, \(X^TX\).
- Let’s pause here for a refresher on linear algebra:
- Eigenvectors represent directions and eigenvalues represent magnitude (or in this scenario, importance). Eigenvectors are stored in a dense matrix while eigenvalues are stored in a matrix with eigenvalues on the diagonal and zeros everywhere else. These eigenvalues on the diagonal are associated with the corresponding column in the eigenvector matrix. For example, an eigenvalue stored on the diagonal in the second column of its matrix corresponds to the second column in the eigenvector matrix.
- Luckily, there are functions in R that will compute these values for us, but it’s important to understand the eigenvectors and eigenvalues it outputs.
- Eigenvectors represent directions and eigenvalues represent magnitude (or in this scenario, importance). Eigenvectors are stored in a dense matrix while eigenvalues are stored in a matrix with eigenvalues on the diagonal and zeros everywhere else. These eigenvalues on the diagonal are associated with the corresponding column in the eigenvector matrix. For example, an eigenvalue stored on the diagonal in the second column of its matrix corresponds to the second column in the eigenvector matrix.
- Let’s pause here for a refresher on linear algebra:
- Sort the eigenvalues from largest to smallest
- Since eigenvalues correspond to their eigenvectors, the eigenvectors are similarly sorted according to the order of their corresponding eigenvalue (put another way, the columns of the eigenvector matrix are reordered according to the reordering of the eigenvalue matrix).
- Since eigenvalues correspond to their eigenvectors, the eigenvectors are similarly sorted according to the order of their corresponding eigenvalue (put another way, the columns of the eigenvector matrix are reordered according to the reordering of the eigenvalue matrix).
- Multiply this sorted eigenvector matrix by the standardized version of X from step 1.
- As a result, we now have a standardized version of X with weights determined by the eigenvectors.
- As a result, we now have a standardized version of X with weights determined by the eigenvectors.
- With this new matrix, determine how many features (principal components) to keep in the new feature matrix. This can be done a couple of ways:
- Arbitrarily determine how many dimensions to keep:
- This is not recommended, but at times may be appropriate for your specific analysis or problem
- Set a threshold for proportion of variance explained:
- As explained earlier, the importance of a principal component or eigenvector is tied to variance. Further, each eigenvalue is approximately the importance of its corresponding eigenvalue (as we saw when sorting our eigenvalues and then eigenvectors). Therefore, we can create a metric for measuring the importance or variance explained by each principal component or eigenvector called the proportion of variance, which is the sum of all eigenvalues kept divided by the sum of all eigenvalues.
- With this metric, we can then pick a threshold of proportion of variance explained and include principal components up until we reach the set threshold.
- As explained earlier, the importance of a principal component or eigenvector is tied to variance. Further, each eigenvalue is approximately the importance of its corresponding eigenvalue (as we saw when sorting our eigenvalues and then eigenvectors). Therefore, we can create a metric for measuring the importance or variance explained by each principal component or eigenvector called the proportion of variance, which is the sum of all eigenvalues kept divided by the sum of all eigenvalues.
- Use a scree plot:
- Similar to the method above, we can use the proportion of variance to determine how many principal components to keep.
- We calculate the proportion of variance for each principal component, sort the principal components by proportion of variance explained (though they technically should already be sorted in order of importance), then plot the proportion of variance explained by each principal component.
- We should then have a line plot with an elbow that denotes the point at which there is the largest drop in proportion of variance explained. We then decide to use the principal components up until this drop. In the scree plot below, there is a considerable drop between proportion of variance explained in PC1 to PC2. We would therefore identify PC2 as the elbow and only include the first feature (PC1) and drop the rest. As you can see, a disadvantage of this method is that it is relatively subjective.
- Similar to the method above, we can use the proportion of variance to determine how many principal components to keep.
- Arbitrarily determine how many dimensions to keep:
Use this new feature matrix with the principal components/eigenvectors we chose to keep as the new X in our linear regression model against the untransformed Y – this type of regression using PCA is known as Principal Components Regression (PCR).
- Note: Each of these new variables resulting from PCA are all independent of each other, which is an important assumption that must hold for a linear model.
Code Implementation
There are a couple of libraries you can use for PCR - one is pls which also can be used for PLS (covered next). We’ll first show how to run PCR using this library:
library(pls)
Run PCR with cross-validation:
pcr.fit = pcr(shares~., data = news, scale=TRUE, validation ="CV")
In the above code, scale is set to TRUE in order to standardize each predictor. validation is set to CV in order to compute ten-fold cross-validation error for each possible value of M (the number of principal components used). Having the cross-validation error will give us another metric to judge principal components by.
Summary:
summary(pcr.fit)
## Data: X dimension: 39644 58
## Y dimension: 39644 1
## Fit method: svdpc
## Number of components considered: 58
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 11627 11629 11604 12222 12482 12236 12259
## adjCV 11627 11628 11603 12158 12394 12171 12191
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 12189 12402 12417 12406 12397 12703 13507
## adjCV 12127 12320 12334 12324 12316 12594 13325
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 13491 13551 13529 13485 13596 13556 13435
## adjCV 13309 13363 13343 13303 13405 13367 13257
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 13422 13546 13597 13745 13267 12597 12668
## adjCV 13245 13358 13405 13541 13103 12494 12559
## 28 comps 29 comps 30 comps 31 comps 32 comps 33 comps 34 comps
## CV 12742 12697 12108 12096 12140 12846 12849
## adjCV 12626 12585 12051 12040 12080 12721 12723
## 35 comps 36 comps 37 comps 38 comps 39 comps 40 comps 41 comps
## CV 12929 12418 13047 12271 12332 12322 12354
## adjCV 12796 12331 12902 12198 12253 12244 12273
## 42 comps 43 comps 44 comps 45 comps 46 comps 47 comps 48 comps
## CV 11531 11534 11579 11579 11570 11573 11535
## adjCV 11530 11533 11573 11573 11565 11569 11533
## 49 comps 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps
## CV 11595 11569 12045 11909 11881 15445 11519
## adjCV 11587 11563 11991 11869 11844 15098 11517
## 56 comps 57 comps 58 comps
## CV 1.407e+12 1.411e+12 1.411e+12
## adjCV 1.334e+12 1.339e+12 1.339e+12
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 8.27073 15.1673 21.356 26.531 31.220 35.647 39.828 43.76
## shares 0.09982 0.4832 1.129 1.129 1.226 1.228 1.245 1.28
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 47.39 50.908 54.359 57.60 60.380 62.757 65.107
## shares 1.28 1.331 1.331 1.34 1.456 1.587 1.615
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 67.237 69.355 71.443 73.498 75.520 77.467 79.338
## shares 1.623 1.624 1.634 1.647 1.664 1.664 1.664
## 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps
## X 81.14 82.701 84.202 85.594 86.838 88.005 89.131
## shares 1.67 1.672 1.739 1.756 1.763 1.763 1.785
## 30 comps 31 comps 32 comps 33 comps 34 comps 35 comps 36 comps
## X 90.185 91.174 92.136 93.058 93.869 94.663 95.421
## shares 1.785 1.836 1.856 1.858 1.922 1.922 1.943
## 37 comps 38 comps 39 comps 40 comps 41 comps 42 comps 43 comps
## X 96.050 96.6 97.118 97.539 97.954 98.337 98.667
## shares 1.964 2.0 2.002 2.004 2.006 2.009 2.026
## 44 comps 45 comps 46 comps 47 comps 48 comps 49 comps 50 comps
## X 98.983 99.206 99.375 99.513 99.643 99.745 99.836
## shares 2.028 2.029 2.029 2.032 2.195 2.198 2.206
## 51 comps 52 comps 53 comps 54 comps 55 comps 56 comps 57 comps
## X 99.901 99.955 99.999 100.000 100.00 100.00 100.00
## shares 2.284 2.285 2.293 2.301 2.31 2.31 2.31
## 58 comps
## X 100.00
## shares 2.31
From this summary, we see that the smallest RMSE is at 55 components (M=55), at which point 100% of variance is explained.
More info on the pls package is available here.
Next, we can use the function prcomp() within the core stats package for PCR. The difference in setting up pcr() vs. prcomp() is that prcomp() needs the data passed in with no response variable. Therefore, we create an X from our data, removing our response variable, shares.
X = news %>% dplyr::select(-shares)
prcomp.fit = prcomp(X, scale = TRUE)
summary(prcomp.fit)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.19021 2.00001 1.89462 1.73250 1.64898 1.60249 1.55731
## Proportion of Variance 0.08271 0.06897 0.06189 0.05175 0.04688 0.04428 0.04181
## Cumulative Proportion 0.08271 0.15167 0.21356 0.26531 0.31220 0.35647 0.39828
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.50916 1.45274 1.42758 1.41471 1.37207 1.26879 1.17421
## Proportion of Variance 0.03927 0.03639 0.03514 0.03451 0.03246 0.02776 0.02377
## Cumulative Proportion 0.43755 0.47394 0.50908 0.54359 0.57604 0.60380 0.62757
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.1675 1.1115 1.10839 1.10049 1.09180 1.08293 1.06248
## Proportion of Variance 0.0235 0.0213 0.02118 0.02088 0.02055 0.02022 0.01946
## Cumulative Proportion 0.6511 0.6724 0.69355 0.71443 0.73498 0.75520 0.77467
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 1.04181 1.02235 0.95146 0.93317 0.89832 0.84948 0.82282
## Proportion of Variance 0.01871 0.01802 0.01561 0.01501 0.01391 0.01244 0.01167
## Cumulative Proportion 0.79338 0.81140 0.82701 0.84202 0.85594 0.86838 0.88005
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.80809 0.78173 0.75746 0.74714 0.73103 0.68603 0.67857
## Proportion of Variance 0.01126 0.01054 0.00989 0.00962 0.00921 0.00811 0.00794
## Cumulative Proportion 0.89131 0.90185 0.91174 0.92136 0.93058 0.93869 0.94663
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.66318 0.60379 0.5647 0.54828 0.49432 0.49033 0.47134
## Proportion of Variance 0.00758 0.00629 0.0055 0.00518 0.00421 0.00415 0.00383
## Cumulative Proportion 0.95421 0.96050 0.9660 0.97118 0.97539 0.97954 0.98337
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.4377 0.42836 0.35967 0.31239 0.28277 0.2748 0.24416
## Proportion of Variance 0.0033 0.00316 0.00223 0.00168 0.00138 0.0013 0.00103
## Cumulative Proportion 0.9867 0.98983 0.99206 0.99375 0.99513 0.9964 0.99745
## PC50 PC51 PC52 PC53 PC54 PC55
## Standard deviation 0.22955 0.19366 0.17714 0.16042 0.01593 0.006814
## Proportion of Variance 0.00091 0.00065 0.00054 0.00044 0.00000 0.000000
## Cumulative Proportion 0.99836 0.99901 0.99955 0.99999 1.00000 1.000000
## PC56 PC57 PC58
## Standard deviation 1.554e-05 5.273e-15 4.019e-15
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00
Similar to pcr(), we have a summary table with all PCs. Instead, here we have Proportion of Variance and Cumulative Proportion available to us.
Next, visualize eigenvalues (scree plot). We will use fviz_eig located within the factoextra package. Show the percentage of variances explained by each principal component. 58 is passed in for ncp in order to visualize all PCs. If you would like to view less, you can change this input.
library(factoextra)
fviz_eig(prcomp.fit, ncp = 58)
At this point, we could have either chosen a threshold for proportion of variance explained or rely on finding an elbow in the scree plot. Let’s say we chose a threshold of 85% proportion of variance explained. We would then see that we need to add components up until PC26 (including PC26) in order to meet this threshold, and would therefore keep the PCs up until and including PC26 in the model. In contrast, if we were to use the scree plot to find the elbow, there is arguably a large drop around PC5, at which point only 31% of variance is explained. Even if we were to continue until the next drop and elbow at PC12, only 58% of variance is explained.
More info on the prcomp is available here.
More info on the factoextra package is available here.
Partial Least Squares (PLS)
Partial least squares (PLS) is very similar to principal components regression (PCR), in that both generate a new set of features that are linear combinations of the original features. However, unlike PCR, PLS incorporates information from the response variable when creating these new features. In doing so, PLS attempts to identify directions that explain the most variance in both the features and the response, rather than just explaining the variance in the features. Because PCR only considers the variance in the features, it may not be best for prediction, particularly if there is some variance in the features unrelated to the variance in the response.
PLS can be thought of as a supervised alternative to PCR’s unsupervised approach. This is not to say that PCR cannot be used for supervised learning problems (it can be used for either supervised or unsupervised problems), but instead describes the method of generating the new features. PLS, on the other hand, can only be used for supervised problems, since the response variable is used in the creation of the new features.
Algorithm
PLS generates new features \(Z\) that are linear combinations of the original features \(X\). This means that the \(m\)th new feature can be written as \(Z_m = \sum_{j = 1}^p \phi_{jm}X_j\), where \(p\) is the total number of original features.
This process for creating these new features is as follows:
Standardize all features.
Initialize \(X^{(0)} = X\) and \(m=1\).
Compute the first direction \(Z_1\) by setting \(\phi_{j1}\) equal to the coefficient of ordinary least squares linear regression between \(Y\) and \(X_j\). In doing so, the \(X_j\)’s most highly correlated with \(Y\) receive the highest weights in \(Z_1\).
Orthogonalize \(X\) with respect to \(Z_1\) by regressing each variable \(X_j\) on \(Z_1\) and taking the residuals. Denote these orthogonalized values \(X^{(1)}\). This means that \(X^{(1)}\) is essentially the information not already explained by the first direction.
Repeat steps 2-4 to calculate each \(Z_m\) for \(m = 2, ..., p\), using the orthogonalized \(X^{(m-1)}\) output from the previous step instead of \(X\) to generate \(Z_m\). So \(X^{(1)}\) is used to calculate \(Z_2\), \(X^{(2)}\) is used to calculate \(Z_3\), and so on.
Use ordinary least squares to regress \(Y\) on \(Z_1, ..., Z_m\) for some \(m \le p\). Generally use cross-validation to choose the best value of \(m\).
Code Implementation
To implement PLS in R, we can use the pls package. After loading the package, the basic syntax for fitting PLS regression is very similar to that of linear regression with the lm function, though in this case we will use the plsr function. The first argument is a formula to tell the function what the response and predictor variables are. In this case, we’re using online news data with shares as the response and all other variables as predictors. To avoid having to type out 58 variables in the formula, we can instead use shares ~ . to indicate that all columns besides shares should be treated as predictor variables. Additionally, to standardize the predictors, we can specify scale = TRUE and center = TRUE.
#load the pls library
library(pls)
#fit the pls regression
pls_fit <- plsr(shares ~ ., data = news, scale = TRUE, center = TRUE)
summary(pls_fit)
## Data: X dimension: 39644 58
## Y dimension: 39644 1
## Fit method: kernelpls
## Number of components considered: 58
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 6.381 10.905 16.268 19.975 24.040 28.063 30.560 33.075
## shares 1.624 1.896 2.001 2.075 2.121 2.154 2.202 2.238
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 35.176 37.555 39.87 42.113 45.369 47.846 50.660
## shares 2.263 2.274 2.28 2.283 2.285 2.287 2.289
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 53.01 54.760 57.655 59.601 62.131 63.598 65.113
## shares 2.29 2.292 2.292 2.293 2.293 2.293 2.293
## 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps
## X 67.478 69.166 70.471 71.769 73.544 75.906 77.257
## shares 2.293 2.293 2.293 2.293 2.293 2.293 2.294
## 30 comps 31 comps 32 comps 33 comps 34 comps 35 comps 36 comps
## X 78.732 79.454 80.527 81.817 82.824 83.744 84.837
## shares 2.294 2.294 2.294 2.295 2.296 2.297 2.298
## 37 comps 38 comps 39 comps 40 comps 41 comps 42 comps 43 comps
## X 85.578 86.978 87.800 88.806 90.166 90.601 92.263
## shares 2.301 2.303 2.304 2.304 2.304 2.304 2.304
## 44 comps 45 comps 46 comps 47 comps 48 comps 49 comps 50 comps
## X 93.097 94.168 95.511 96.111 96.761 97.366 97.875
## shares 2.304 2.304 2.304 2.304 2.304 2.304 2.304
## 51 comps 52 comps 53 comps 54 comps 55 comps 56 comps 57 comps
## X 98.610 99.366 99.606 99.922 100.00 100.00 100.00
## shares 2.304 2.305 2.305 2.306 2.31 2.31 2.31
## 58 comps
## X 100.00
## shares 2.31
The summary of the fit shows the percentage of variance explained by each additional component. In this example, we see that five components explain 24% of the variance in the data, and 10 components explain 38% of the variance in the data.
We can also make predictions for new data using our PLS model. As with fitting the model, this process is very similar to ordinary least squares regression. We will use the predict function, but will need to include an additional parameter ncomp to specify the number of components to include.
More info on the pls package is available here.
#predict for given number of components (3 in this case)
predict(pls_fit, ncomp = 3, newdata = news[39640:39644,])
## , , 3 comps
##
## shares
## 39640 3631.6429
## 39641 5093.6567
## 39642 5594.3732
## 39643 799.7493
## 39644 1139.6305
While the percentage of variance explained gives us some information about how many components to include, best practice is to choose this number using cross-validation. One method of cross-validation would involve a double for loop to iterate through subsets of the data and number of components, making predictions and calculating an error measure (generally root mean squared error for numeric predictor variables) for each combination, then choosing the number of components with the smallest RMSE. However, the caret package can simplify this process considerably using the train function. We need to include the same parameters as in the function call to plsr, along with a few additional parameters. We specify method = "pcr" since train can handle many different model types, trControl = trainControl("cv", number = 10) to indicate that we want 10-fold cross-validation, and tuneLength = 10 to indicate that we want to test values from 1 to 10 for the ncomp parameter.
library(caret)
#set seed for reproducibility, since CV involves random partitions
set.seed(300)
pls_cv <- train(shares ~ ., data = news, scale = TRUE, center = TRUE,
method = "pcr", trControl = trainControl("cv", number = 10),
tuneLength = 10)
plot(pls_cv)
pls_cv$bestTune
## ncomp
## 2 2
From cross-validation, we see that two components is the best choice, and that including more than that may lead to overfitting.
More info on the caret package can be found here.
Pros & Cons
PLS can perform well for prediction when there are many features, even when multicollinearity is present, violating the assumptions for ordinary least squares (OLS). Multicollinearity occurs when some predictor variables are highly correlated with some combination of other predictor variables, and occurs increasingly often as the number of predictor variables grows. Therefore PLS can be useful for prediction with high-dimensional data (when the number of predictors \(p\) is large relative to the number of observations \(n\)). However, PLS is less interpretable than OLS in understanding the underlying relationship between features and response, since it involves the creation of new features.
PLS generally has lower bias but higher variance than PCR.
PLS can have minor instability in its estimates because it tends to shrink low-variance directions but sometimes inflates high-variance directions. Ridge regression shrinks all directions, but especially low-variance directions, while PCR discards low-variance directions.
PLS, PCR, and ridge regression typically perform very similarly, but for minimizing prediction error, ridge regression is generally preferred because it shrinks smoothly as opposed to in discrete steps.
Conclusion
In summary, it is important to consider the number of predictors being used to fit your model in OLS. Commonly, not all predictors are necessary and instead can lead to overfitting. We have outlined a few methods for either removing or transforming these predictors and unfortunately, one method is not always going to be the best method - it depends on the context of your data and the problem. Ridge does tend to most frequently outperform the others, but it’s important to use discretion and due diligence to test or consider a couple methods, and then perhaps compare the resulting models using MSE.
We hope you’ve enjoyed our tutorial!
References
[1] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R. New York: Springer, 2013.
[2] T. Hastie, J. Friedman, and R. Tisbshirani, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York: Springer, 2009.
[3] R. Merrett, “Online News Popularity.” Available: https://code.datasciencedojo.com/datasciencedojo/datasets/tree/master/Online News Popularity [Accessed: 03-Dec-2020].
[4] M. Brems, “A One-Stop Shop for Principal Component Analysis.” Available: https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c. [Accessed: 04-Dec-2020].
[5] D. Montgomery, E. Peck, and G. Vining, Introduction To Linear Regression Analysis. 5th ed. Wiley, 2012.
[6] Z, Jaadi, “A Step-by-Step Explanation of Principal Component Analysis.” Available: https://builtin.com/data-science/step-step-explanation-principal-component-analysis. [Accessed: 05-Dec-2020].
[7] R. Tobias, “An Introduction to Partial Least Squares Regression.” Available: https://stats.idre.ucla.edu/wp-content/uploads/2016/02/pls.pdf. [Accessed: 03-Dec-2020].
[8] Kassambara, “Principal Component and Partial Least Squares Regression Essentials,” STHDA, 11-Mar-2018. [Online]. Available: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/152-principal-component-and-partial-least-squares-regression-essentials/. [Accessed: 07-Dec-2020].
[9] B.-H. Mevik and R. Wehrens, “Introduction to the pls Package,” 04-Aug-2020. [Online]. Available: https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf. [Accessed: 07-Dec-2020].
[10] M. Kuhn, “The caret Package,” 27-Mar-2019. [Online]. Available: http://topepo.github.io/caret/index.html. [Accessed: 07-Dec-2020].
[11] M. Porter, “Data Mining Lecture .” Data Mining Lecture . 10-Sept-2020, Charlottesville. https://mdporter.github.io/SYS6018/lectures/03-penalized.pdf
[12] J. Friedman et al., glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. 2020. https://CRAN.R-project.org/package=glmnet
[13] M. Porter, “Supervised Learning II Lecture .” 01-Sept-2020, Charlottesville. https://mdporter.github.io/SYS6018/lectures/03-penalized.pdf
[14] G. Elumalai, “Pros and cons of common Machine Learning algorithms,” Medium, Nov. 19, 2019. https://medium.com/@gokul.elumalai05/pros-and-cons-of-common-machine-learning-algorithms-45e05423264f [Accessed 07-Dec-2020].
[15] “Penalized Regression Essentials: Ridge, Lasso & Elastic Net - Articles - STHDA.” http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/153-penalized-regression-essentials-ridge-lasso-elastic-net/ [Accessed 07-Dec-2020].
[16] A. Kassambara et al., factoextra: Extract and Visualize the Results of Multivariate Data Analyses. 2020. https://CRAN.R-project.org/package=factoextra